14.3 XML Processing

Here are the RSS feeds of some news organizations.

We will explore reading, cleaning and manipulating the data for the following tasks.

  1. Figure out how many stories are there in each RSS feed.
  2. What percent of the stories relate to the president?
  3. Think of a good way to organize all of news data for further processing.
  4. Visualize the focus of the stories in a simple way by creating a wordcloud for the three sites.

14.3.1. Understand the XML structure

Let us try to understand how the RSS feed data actually looks. You would think that by just visiting the RSS feed site, you might be able to discern the structure. Unfortunately, this is not very straightforward since most RSS feed sites detect that a browser is being used and automatically present the contents as a web page. Furthermore, browsers will detect some style elements and prevent proper loading for safety reasons. So sometimes a manual editing is needed to view the structure. For example, for the BBC and CNN sites below, I had to delete the xml-stylesheet element to render properly in my browser.

There are nice XML XML viewer plugins to view XML. One is called XV and configure its options to intercept any RSS feeds in a browser. Then, if you open the CNN URL in a browser, the structure will be shown as in the picture above.

Later, when we talk about XPath, you will find tools like Selector Gadget and XPath Helper very useful for inspecting the page content and narrowing to parts of your XML document in the browser.

14.3.2 The Actual XML structure

So here are pictures with the structure for each site.

CNN RSS Structure

NYTimes RSS Structure

BBC RSS Structure

XPath Expressions

A quick examination shows that the stories are marked up using <item> tags. And within an item, the headline is marked up with <title> and the summary with <description>.

For our exercise, we are only interested in parts of the XML tree containing these tags; the rest of the data is to be ignored. This means we need a way to match part of the tree of interest to us. We do this in fashion analogous to how we locate files or folders in a computer.

To specify the iTunes subfolder of your Music folder, we can use the construct (called the path expression) Music/iTunes, where the / is used as a separator. The home directory for a user joe on a Mac, for example is /Users/joe. The latter is an unambiguous construct since it begins with a / specifying the root of the directory structure. Continuing, the unambiguous path for user joe’s iTunes directory would be /Users/joe/Music/iTunes.

The analogs for titles and descriptions are item/title and item/description and these are called, naturally enough, Xpath expressions. So XPath allows us to search and target for information in XML documents using Xpath expressions.

The precise Xpath constructs, however, are /rss/channel/item/title and /rss/channel/item/description, specifying the full pedigree of the tags. The starting slash indicates that we want to start from the root of the tree.

A lazier Xpath construct would be //item/title for title and //item/description for description. The double slashes indicate that we are not specifying what may occur between the two slashes. Thus /foo/item/title would be a match and included as would /bar/item/title. Thus //item/title shortcut ensures that we would match every title tag that is a child of an item tag, no matter how deep or where in the tree it occurs. (Note that this may not be what one desires in every situation, but it would certainly suffice in our case since the title tag is always a child of an item tag.)

Relative Xpaths are also useful. Often, selects a set of nodes that match a pattern, and all further processing is related to children of these selected nodes. Then relative paths, such as ./title or the less specific .//title once you have selected all item nodes in the XML document.

Armed with this, we can proceed.

14.3.3. How many stories?

We will use a library called xml2 to handle the processing of the RSS XML data. So we load the library and call the read_xml function from xml2 to read the data, starting with CNN

library(xml2)
cnn_url <- 'http://rss.cnn.com/rss/cnn_topstories.rss'
cnn_xml <- read_xml(x = cnn_url)

We parse out the titles and descriptions using the Xpath expressions using the xml_find_all function.

title_xpath <- "/rss/channel/item/title"
description_xpath <- "/rss/channel/item/description"
cnn_titles <- xml_find_all(x = cnn_xml, xpath = title_xpath)
print(cnn_titles)
## {xml_nodeset (69)}
##  [1] <title><![CDATA[With a five-month impeachment saga behind them, sources  ...
##  [2] <title><![CDATA[Schiff calls Bolton's decision inexplicable]]></title>
##  [3] <title><![CDATA[House managers say Trump hasn't learned a lesson from im ...
##  [4] <title><![CDATA[Opinion: Trump's disturbing 'celebration']]></title>
##  [5] <title><![CDATA[Pro-Trump groups flood social media with videos like thi ...
##  [6] <title><![CDATA[Analysis: The hidden worst part of Trump's unhinged impe ...
##  [7] <title><![CDATA[Exclusive photos of Giuliani in Spain show Lev Parnas ha ...
##  [8] <title><![CDATA[Appeals court tosses Democrats' emoluments suit against  ...
##  [9] <title><![CDATA[CNN fact checks Trump claim about Ivanka during speech]] ...
## [10] <title><![CDATA[CNN analysis shows errors in Iowa count ]]></title>
## [11] <title><![CDATA[Analysis: Democrats could have a contested convention]]> ...
## [12] <title><![CDATA[How Pete Buttigieg rose to the top ]]></title>
## [13] <title><![CDATA[What to watch in Friday night's debate]]></title>
## [14] <title><![CDATA[Key impeachment witness expects to leave White House pos ...
## [15] <title><![CDATA[State to end a holiday for Robert E. Lee. Election Day w ...
## [16] <title><![CDATA[Harvey Weinstein's lawyer says she's never been sexually ...
## [17] <title><![CDATA[Coronavirus cases skyrocket, with more than 31,000 cases ...
## [18] <title><![CDATA[The couple accused of drugging and raping women are in a ...
## [19] <title><![CDATA[Teen killed while rapping on Facebook Live ]]></title>
## [20] <title><![CDATA[Ford shakes up its management]]></title>
## ...

Ok, we’ve narrowed down the nodes, but we still find the printed items containing the markup tags like <title> and <description>. The function xml_contents will remove that for us.

cnn_titles <- xml_contents(x = cnn_titles)
print(cnn_titles)
## {xml_nodeset (69)}
##  [1] <![CDATA[With a five-month impeachment saga behind them, sources say act ...
##  [2] <![CDATA[Schiff calls Bolton's decision inexplicable]]>
##  [3] <![CDATA[House managers say Trump hasn't learned a lesson from impeachme ...
##  [4] <![CDATA[Opinion: Trump's disturbing 'celebration']]>
##  [5] <![CDATA[Pro-Trump groups flood social media with videos like this ]]>
##  [6] <![CDATA[Analysis: The hidden worst part of Trump's unhinged impeachment ...
##  [7] <![CDATA[Exclusive photos of Giuliani in Spain show Lev Parnas has lots  ...
##  [8] <![CDATA[Appeals court tosses Democrats' emoluments suit against Trump]]>
##  [9] <![CDATA[CNN fact checks Trump claim about Ivanka during speech]]>
## [10] <![CDATA[CNN analysis shows errors in Iowa count ]]>
## [11] <![CDATA[Analysis: Democrats could have a contested convention]]>
## [12] <![CDATA[How Pete Buttigieg rose to the top ]]>
## [13] <![CDATA[What to watch in Friday night's debate]]>
## [14] <![CDATA[Key impeachment witness expects to leave White House post, sour ...
## [15] <![CDATA[State to end a holiday for Robert E. Lee. Election Day will be  ...
## [16] <![CDATA[Harvey Weinstein's lawyer says she's never been sexually assaul ...
## [17] <![CDATA[Coronavirus cases skyrocket, with more than 31,000 cases worldw ...
## [18] <![CDATA[The couple accused of drugging and raping women are in a state  ...
## [19] <![CDATA[Teen killed while rapping on Facebook Live ]]>
## [20] <![CDATA[Ford shakes up its management]]>
## ...

Hmmm, better, but not there yet. We need to strip the CDATA sections. (What are these, you may ask. Recall that these are parsed character data, that is, data within data; for example, it can be used stuff XML inside XML! Here they are used to stuff arbitrary text data which may further contain other < characters and therefore it is safest to use CDATA!). Enter xml_text.

cnn_titles <- xml_text(x = cnn_titles)
print(cnn_titles)
##  [1] "With a five-month impeachment saga behind them, sources say acting chief of staff Mick Mulvaney's future is now in question"
##  [2] "Schiff calls Bolton's decision inexplicable"                                                                                
##  [3] "House managers say Trump hasn't learned a lesson from impeachment trial"                                                    
##  [4] "Opinion: Trump's disturbing 'celebration'"                                                                                  
##  [5] "Pro-Trump groups flood social media with videos like this "                                                                 
##  [6] "Analysis: The hidden worst part of Trump's unhinged impeachment victory speech"                                             
##  [7] "Exclusive photos of Giuliani in Spain show Lev Parnas has lots more to share"                                               
##  [8] "Appeals court tosses Democrats' emoluments suit against Trump"                                                              
##  [9] "CNN fact checks Trump claim about Ivanka during speech"                                                                     
## [10] "CNN analysis shows errors in Iowa count "                                                                                   
## [11] "Analysis: Democrats could have a contested convention"                                                                      
## [12] "How Pete Buttigieg rose to the top "                                                                                        
## [13] "What to watch in Friday night's debate"                                                                                     
## [14] "Key impeachment witness expects to leave White House post, source says"                                                     
## [15] "State to end a holiday for Robert E. Lee. Election Day will be a day off instead"                                           
## [16] "Harvey Weinstein's lawyer says she's never been sexually assaulted 'because I would never put myself in that position'"     
## [17] "Coronavirus cases skyrocket, with more than 31,000 cases worldwide"                                                         
## [18] "The couple accused of drugging and raping women are in a state of disbelief as charges are dropped"                         
## [19] "Teen killed while rapping on Facebook Live "                                                                                
## [20] "Ford shakes up its management"                                                                                              
## [21] "Hillary Clinton opens up about her marriage on 'Ellen'"                                                                     
## [22] "US economy added 225,000 jobs"                                                                                              
## [23] "Analysis: Best (and worst) of Iowa caucuses "                                                                               
## [24] "Credit Suisse CEO resigns after spy scandal"                                                                                
## [25] "6 takeaways from CNN's town halls "                                                                                         
## [26] "Veteran UK TV host comes out as gay during show"                                                                            
## [27] "Surfing champion dies at age 24"                                                                                            
## [28] "Gymnast shamed over video of daughter "                                                                                     
## [29] "Fox sneaks into Parliament and causes mayhem"                                                                               
## [30] "Baby carriers sold at Target and Amazon recalled "                                                                          
## [31] "'Wheel of Fortune' contestant's answer stuns judges"                                                                        
## [32] "Pablo Escobar's chief hitman 'Popeye' is dead"                                                                              
## [33] "Jimmy Kimmel skewers Don Jr.'s resume "                                                                                     
## [34] "Spotify, Apple Music and Amazon Music could soon hike prices. This company is betting on it"                                
## [35] "A single grain of moon dust has a lot to say"                                                                               
## [36] "How a beloved organic grocery chain collapsed"                                                                              
## [37] "Another Whole Foods competitor just bit the dust"                                                                           
## [38] "Macy's dealt a blow to the struggling American mall"                                                                        
## [39] "Sephora is opening 100 new stores"                                                                                          
## [40] "Forever 21 enters deal to sell for $81 million"                                                                             
## [41] "Slashing food stamps hurts the poor and stores"                                                                             
## [42] "Oscars predictions: The message a win by each nominee would send"                                                           
## [43] "The weird way Oscar votes are counted"                                                                                      
## [44] "Opinion: The Oscars' 'Harriet Tubman problem'"                                                                              
## [45] "Review: 'Birds of Prey' is bone-crunching mayhem"                                                                           
## [46] "The week in 47 photos"                                                                                                      
## [47] "The aviation museum for people who don't care about aviation"                                                               
## [48] "See moment Harlem Globetrotters reunite military family"                                                                    
## [49] "Britney Spears museum lets fans relive her most iconic moments"                                                             
## [50] "How Trump's three years of job gains compares to Obama's"                                                                   
## [51] "Things keep getting worse for WWE"                                                                                          
## [52] "The agony and ecstasy of being forced to give up our phones at Broadway shows and concerts"                                 
## [53] "Southwest is giving its employees 6 weeks of extra pay"                                                                     
## [54] "Democrats whine while Republicans govern"                                                                                   
## [55] "Trump's revenge on New York"                                                                                                
## [56] "Oscar winners, follow Joaquin Phoenix's lead"                                                                               
## [57] "Trump's desperate embrace of 'one trillion trees'"                                                                          
## [58] "Honoring Limbaugh was outrageous"                                                                                           
## [59] "Trump was not exonerated "                                                                                                  
## [60] "After 'American Dirt,' can I call myself Latinx?"                                                                           
## [61] "Hack your mortgage by refinancing to a 15yr fixed"                                                                          
## [62] "9 cards charging 0% interest until 2021"                                                                                    
## [63] "Apple's $7 trillion bet"                                                                                                    
## [64] "How drug wars led to a teen's murder and dismemberment"                                                                     
## [65] "Coronavirus fears lead to worldwide mask shortages"                                                                         
## [66] "Get up to speed on this year's Oscar nominees for best international feature film"                                          
## [67] "Moment astronaut lands back on Earth after 328 days in space"                                                               
## [68] "Former ATF agent at center of legal dispute over AR-15"                                                                     
## [69] "Las Vegas police find woman's body in suitcase "

There, finally.

A Pipeline Approach

Note that we had to call a series of functions in order to process this data and that naturally fits into the pipeline pattern. Thus, all of the above processing may be succinctly performed using the piping operator %>%:

library(magrittr)
cnn_titles <- cnn_xml %>%
    xml_find_all(xpath = title_xpath) %>%
    xml_contents %>%
    xml_text

We can repeat this for the NYT and BBC.

## NYT
nyt_url<- 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
nyt_xml <- read_xml(x = nyt_url)
nyt_titles <- nyt_xml %>%
    xml_find_all(xpath = title_xpath) %>%
    xml_contents %>%
    xml_text

## BBC
bbc_url<- 'http://feeds.bbci.co.uk/news/rss.xml?edition=us'
bbc_xml <- read_xml(x = bbc_url)
bbc_titles <- bbc_xml %>%
    xml_find_all(xpath = title_xpath) %>%
    xml_contents %>%
    xml_text

So the answer to question 1.

cat("# CNN Stories:",
    length(cnn_titles),
    "# NYT Stories:",
    length(nyt_titles),
    "# BBC Stories:",
    length(bbc_titles),
    "\n")
## # CNN Stories: 69 # NYT Stories: 50 # BBC Stories: 42

14.3.4. What percent of stories relate to impeachment or corona virus?

This requires us to work with the story content and so we will get the descriptions first, following the same methods above.

cnn_descriptions <- cnn_xml %>%
    xml_find_all(xpath = description_xpath) %>%
    xml_contents %>%
    xml_text

nyt_descriptions <- nyt_xml %>%
    xml_find_all(xpath = description_xpath) %>%
    xml_contents %>%
    xml_text

bbc_descriptions <- bbc_xml %>%
    xml_find_all(xpath = description_xpath) %>%
    xml_contents %>%
    xml_text

We’ll assume that the word we are looking for is Trump or president, although this approach isn’t perfect, since it will capture Ivanka Trump as well as president of some other country.

This is a common enough task: looking for an occurrence for a text pattern in a corpus or data. So our search is for very simple strings. By default, the R grep function will return those indices where a match occurs and grepl will return TRUE if match, FALSE if not. See below.

regex  <- "trump|president|impeachment|schiff|corona|virus"
cnn_mentions <- sum(grepl(regex, cnn_descriptions, ignore.case = TRUE))
nyt_mentions <- sum(grepl(regex, nyt_descriptions, ignore.case = TRUE))
bbc_mentions <- sum(grepl(regex, bbc_descriptions, ignore.case = TRUE))

cat("CNN mentions:",
    cnn_mentions/length(cnn_titles),
    "NYT mentions:",
    nyt_mentions/length(nyt_titles),
    "BBC mentions:",
    bbc_mentions/length(bbc_titles),
    "\n")
## CNN mentions: 0.2753623 NYT mentions: 0.22 BBC mentions: 0.07142857

14.3.5. Organize the data

Since the number of stories are different for the various sites, one obvious way is to create a separate data frame for each.

library(tibble)
cnn <- tibble(title = cnn_titles, description = cnn_descriptions)
nyt <- tibble(title = nyt_titles, description = nyt_descriptions)
bbc <- tibble(title = bbc_titles, description = bbc_descriptions)

However, it is obvious that this is unsatisfactory because the entire data is in three different places. It would better to aim for a single tidy data set.

A natural construct for a single data frame is to add a column for the site (CNN or NYT or BBC) in addition to the title and description columns as above.

site <- c(rep("CNN", length(cnn_titles)),
          rep("NYT", length(nyt_titles)),
          rep("BBC", length(bbc_titles)))
news <- tibble(site = site,
               title = c(cnn_titles, nyt_titles, bbc_titles),
               description = c(cnn_descriptions, nyt_descriptions, bbc_descriptions))
tibble::glimpse(news)
## Observations: 161
## Variables: 3
## $ site        <chr> "CNN", "CNN", "CNN", "CNN", "CNN", "CNN", "CNN", "CNN", "…
## $ title       <chr> "With a five-month impeachment saga behind them, sources …
## $ description <chr> "With a five-month impeachment saga behind them, White Ho…

14.3.6. Wordcloud

Word clouds are a popular way to describe topic areas that textual matter pertains to: the more frequent a term, the more prominently it is displayed in a word cloud.

For this purpose, we will make use of some packages for processing data, (dplyr), for text mining (tm and tm.plugin.webmining) and for generating word clouds (wordcloud).

library(dplyr)
library(tm)
library(rvest)
library(wordcloud)

Our goal is to take all the descriptions from the CNN articles and combine them into one long string of words. Here are the steps.

The last invocation of the paste command concatenates the descriptions with a space (" ") in between.

library(dplyr)
cnn_words <- news %>%
    filter( site == "CNN") %>%
    select( description) %>%
    summarize(words = paste(description, collapse = " "))

Now we can just focus on the concatenated string:

cnn_words <- cnn_words$words

Removing HTML Markup

If you glimpse back at the news tibble, you see that there are a whole bunch of HTML markup and crud in addition to the words. So we need to clean them up. We also want to remove some common words that are not of interest.

The rvest package provides a facility to remove HTML markup. For example,

library(rvest)
'Rvest, please <a href="http://foo.com">REMOVE STUFF AROUND ME</a>' %>%
    read_html %>%
    html_text
## [1] "Rvest, please REMOVE STUFF AROUND ME"

will remove the HTML markup. Note that this approach does not work with text that does not contain HTML! So we have to detect if there is HTML. If it is there, we should remove it, else we should do nothing.

This offers us an opportunity to write a small (simplistic) function that will do the job. We exploit the the grepl function we saw earlier that detects whether a pattern exists or not in strings.

stripHTMLIfPresent <- function(string) {
    if (grepl("<.*?>", string)) {
        html_text(read_html(string))
    } else {
        string
    }
}

Read the above function as follows: if string contains any number of characters between angled brackets, it contains HTML so strip HTML stuff, else just return it unmodified. This will do for us.

Handling Internationalization

Next, many sites use internationalization, which means the text in the descriptions is not just plain ASCII, but includes some odd characters such as <U+201C>.

For example, the apostrophe in Trump's is often encoded as a Unicode string. This means that code that constructs word clouds can miscount the occurrence of the words.

The easiest way to handle this problem is to detect if the encoding of a text is UTF-8 and convert it to ASCII, and set all such characters to the empty string. We do that in the invocation of iconv below where sub is set to the empty string.

Handling Punctuation

Next the tm package provides a way of removing punctuation (removePunctuation) and some words not of interest, the so called stopwords function. You can examine what these stop words are.

print(stopwords())
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

The call stopwords('SMART') that provides a larger set of stop words and so it makes sense to use the union set of the two to really clean up. Some sites also include the day of the week (Monday, Tuesday, etc.), so let’s include those as well.

our_stopwords <- union(stopwords(), stopwords('SMART'))
our_stopwords <- union(our_stopwords,
                       c("sunday", "monday", "tuesday", "wednesday",
                         "thursday", "friday", "saturday"))

Also, to make things more compact in the pipeline below, we have used the with function in R which evaluates an expression within the context of a data frame: with(data.frame(x = 1, y = 2), x) will return the value of x which is 1.

Finally, we can handle all three sites.

CNN

cnn_words <- news %>%
    filter( site == "CNN") %>%
    select( description) %>%
    summarize(words = paste(description, collapse = " ")) %>%
    with(words) %>%
    stripHTMLIfPresent %>%
    removePunctuation %>%
    tolower() %>%
    removeWords(our_stopwords) %>%
    iconv(from = "UTF-8", to = "ASCII", sub = "")
wordcloud(words = cnn_words, min.freq = 2, max.words = 25, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

NYT

nyt_words <- news %>%
    filter( site == "NYT") %>%
    select( description) %>%
    summarize(words = paste(description, collapse = " ")) %>%
    with(words) %>%
    stripHTMLIfPresent %>%
    removePunctuation %>%
    tolower() %>%
    removeWords(our_stopwords) %>%
    iconv(from = "UTF-8", to = "ASCII", sub = "")
wordcloud(words = nyt_words, min.freq = 2, max.words = 25, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

BBC

bbc_words <- news %>%
    filter( site == "BBC") %>%
    select( description) %>%
    summarize(words = paste(description, collapse = " ")) %>%
    with(words) %>%
    stripHTMLIfPresent %>%
    removePunctuation %>%
    tolower() %>%
    removeWords(our_stopwords) %>%
    iconv(from = "UTF-8", to = "ASCII", sub = "")
wordcloud(words = bbc_words, min.freq = 2, max.words = 25, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

14.3.7. Better Colors

A package called RColorBrewer makes it very easy to generate color combinations to display divergence, or sequential trends etc.

library(RColorBrewer)
palette <- brewer.pal(n = 9, name="Blues")

Our palette has 9 blue colors varying in intensity of blueness.

CNN

wordcloud(words = cnn_words, min.freq = 2, max.words = 25, random.order = FALSE, colors = palette)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

NYT

wordcloud(words = nyt_words, min.freq = 2, max.words = 25, random.order = FALSE, colors = palette)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

BBC

wordcloud(words = bbc_words, min.freq = 2, max.words = 25, random.order = FALSE, colors = palette)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

Side by side

Using our knowledge of par command, let us plot these side by side.

opar <- par(mfrow=c(1, 3))
wordcloud(words = cnn_words, min.freq = 2, max.words = 25, random.order = FALSE, colors = palette)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
wordcloud(words = nyt_words, min.freq = 2, max.words = 25, random.order = FALSE, colors = palette)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in wordcloud(words = nyt_words, min.freq = 2, max.words = 25,
## random.order = FALSE, : workers could not be fit on page. It will not be
## plotted.
wordcloud(words = bbc_words, min.freq = 2, max.words = 25, random.order = FALSE, colors = palette)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

par(opar)

14.3.8. Javascript Wordcloud

The package wordcloud2 offers a Javascript wordcloud that looks better. Let us try that.

CNN

library(wordcloud2)
library(stringr)
cnn_freq <- cnn_words %>%
    ## Split by words
    str_split(pattern = boundary("word")) %>%
    ## Generate a frequency table
    table %>%
    ## Convert to data frame, renaming columns
    as.data.frame(col.names = c("word", "freq"))
## Plot the word cloud in a browser
wordcloud2(cnn_freq)

NYT

nyt_freq <- nyt_words %>%
    ## Split by words
    str_split(pattern = boundary("word")) %>%
    ## Generate a frequency table
    table %>%
    ## Convert to data frame, renaming columns
    as.data.frame(col.names = c("word", "freq"))
## Plot the word cloud in a browser
wordcloud2(nyt_freq)

NYT

bbc_freq <- bbc_words %>%
    ## Split by words
    str_split(pattern = boundary("word")) %>%
    ## Generate a frequency table
    table %>%
    ## Convert to data frame, renaming columns
    as.data.frame(col.names = c("word", "freq"))
## Plot the word cloud in a browser
wordcloud2(bbc_freq)

14.3.9. Summary

We saw how XML processing can be done using the package xml2. We also some details of the XPath, a language used for picking off parts of the XML document using a Document Object Model (DOM).

Many organizations provide data in XML format. While verbose, it has the advantage of a reliable structure that can be computed against.

14.3.10. Session Info

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin19.2.0 (64-bit)
## Running under: macOS Catalina 10.15.3
## 
## Matrix products: default
## BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices datasets  utils     methods  
## [8] base     
## 
## other attached packages:
##  [1] magrittr_1.5       tm_0.7-7           NLP_0.2-0          wordcloud2_0.2.1  
##  [5] wordcloud_2.6      RColorBrewer_1.1-2 rvest_0.3.5        xml2_1.2.2        
##  [9] forcats_0.4.0      stringr_1.4.0      dplyr_0.8.4        purrr_0.3.3       
## [13] readr_1.3.1        tidyr_1.0.2        tibble_2.1.3       ggplot2_3.2.1     
## [17] tidyverse_1.3.0    png_0.1-7          rmarkdown_2.1      knitr_1.28        
## [21] pkgdown_1.4.1      devtools_2.2.1     usethis_1.5.1     
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.1        pkgload_1.0.2     jsonlite_1.6.1    modelr_0.1.5     
##  [5] assertthat_0.2.1  cellranger_1.1.0  slam_0.1-47       yaml_2.2.1       
##  [9] remotes_2.1.0     sessioninfo_1.1.1 pillar_1.4.3      backports_1.1.5  
## [13] lattice_0.20-38   glue_1.3.1        digest_0.6.23     colorspace_1.4-1 
## [17] htmltools_0.4.0   pkgconfig_2.0.3   broom_0.5.4       haven_2.2.0      
## [21] scales_1.1.0      processx_3.4.1    generics_0.0.2    ellipsis_0.3.0   
## [25] withr_2.1.2       lazyeval_0.2.2    cli_2.0.1         crayon_1.3.4     
## [29] readxl_1.3.1      memoise_1.1.0     evaluate_0.14     ps_1.3.0         
## [33] fs_1.3.1          fansi_0.4.1       nlme_3.1-144      MASS_7.3-51.5    
## [37] pkgbuild_1.0.6    tools_3.6.2       prettyunits_1.1.1 hms_0.5.3        
## [41] lifecycle_0.1.0   munsell_0.5.0     reprex_0.3.0      callr_3.4.1      
## [45] compiler_3.6.2    rlang_0.4.4       rstudioapi_0.10   htmlwidgets_1.5.1
## [49] testthat_2.3.1    gtable_0.3.0      curl_4.3          DBI_1.1.0        
## [53] R6_2.4.1          lubridate_1.7.4   utf8_1.1.4        rprojroot_1.3-2  
## [57] desc_1.2.0        stringi_1.4.5     parallel_3.6.2    Rcpp_1.0.3       
## [61] vctrs_0.2.2       dbplyr_1.4.2      tidyselect_1.0.0  xfun_0.12